Skip to content

Comments

GH-49340: [R] Preserve row order in write_dataset()#49343

Merged
jonkeane merged 5 commits intoapache:mainfrom
marberts:preserve_order
Feb 24, 2026
Merged

GH-49340: [R] Preserve row order in write_dataset()#49343
jonkeane merged 5 commits intoapache:mainfrom
marberts:preserve_order

Conversation

@marberts
Copy link
Contributor

@marberts marberts commented Feb 20, 2026

Rationale for this change

write_dataset(df) need not preserve the row-ordering of df across partitions. The arrow C++ library was recently updated (since 21.0.0) so that row ordering can be preserved when writing across partitions. This is useful for cases where it is assumed that row-ordering is unchanged within each partition.

df <- tibble::tibble(x = 1:1.5e6, g = rep(1:15, each = 1e5))

df |>
  dplyr::group_by(g) |>
  arrow::write_dataset("test1", preserve_order = FALSE)

df |>
  dplyr::group_by(g) |>
  arrow::write_dataset("test2", preserve_order = TRUE)

test1 <- arrow::open_dataset("test1") |>
  dplyr::collect()

test2 <- arrow::open_dataset("test2") |>
  dplyr::collect()

# Current behavior.
all.equal(test1 |> sort_by(~ g), df)
#> [1] "Component \"x\": Mean relative difference: 0.0475804"

# Preserve order.
all.equal(test2 |> sort_by(~ g), df)
#> [1] TRUE

Created on 2026-02-20 with reprex v2.1.1

What changes are included in this PR?

Added an argument preserve_order to write_dataset() that sets FileSystemDatasetWriteOptions.preserve_order to true in the call to ExecPlan_Write().

Are these changes tested?

Partially. The change is small, so I haven't written unit tests. I can revisit this if necessary.

Are there any user-facing changes?

Yes, there is a new argument in write_dataset(). The default keeps the current behavior and the argument appears after all existing arguments, so the change in backwards compatible.

@github-actions
Copy link

⚠️ GitHub issue #49340 has been automatically assigned in GitHub to PR creator.

Copy link
Member

@jonkeane jonkeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution!

Would you mind please writing some tests for this behavior? Somewhere in https://github.com/apache/arrow/blob/main/r/tests/testthat/test-dataset-write.R (+ following similar patterns there) would be lovely.

Copy link
Member

@jonkeane jonkeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the tests, I have some suggestions about naming + slightly more idiomatic expectations.

It also looks like there are some cpp linting issues too: https://github.com/apache/arrow/actions/runs/22290480080/job/64535896409?pr=49343#step:6:42

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Feb 24, 2026
Copy link
Member

@jonkeane jonkeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! One question, but otherwise looks good.

Comment on lines +1035 to +1036
# expect_false(all(unordered_ds$x == df$x)) can fail on certain
# platforms, so is not tested.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, interesting — which platforms? I can also imagine that there are certain circumstances where it will be happen to wind up in the right order by chance too. It would be good to know where this failed though.

@jonkeane jonkeane merged commit 376afb8 into apache:main Feb 24, 2026
16 checks passed
@jonkeane jonkeane removed the awaiting committer review Awaiting committer review label Feb 24, 2026
@marberts
Copy link
Contributor Author

marberts commented Feb 24, 2026

Awesome, thanks!

For your question, it seems like write_dataset() behaves differently on windows and ubuntu. In some cases row-order is kept on windows when preserve_order = FALSE (but not always). We can see this from the CI job when I had the test expect_false(all(unordered_ds$x == df$x)): it passed on ubuntu and windows with an older version of R, but failed on windows with a newer version of R (https://github.com/apache/arrow/actions/runs/22290480076/job/64535896168). I could reproduce this on my machine with R 4.5.2, with the same call to write_dataset() not preserving order on ubuntu and preserving order on windows.

@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit 376afb8.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants